Analyzing my Spotify Streaming History#

Author: Noah Stemen

Course Project, UC Irvine, Math 10, Summer 2023

Introduction#

The goal of this project is to explore aspects of my perosonal streaming history as provided by Spotify and examine the unique variables provided by the “1 Million Tracks” dataset. After creating a new dataframe of only what is included in both, I will explore a few trends of the variables with visuals and use linear regression to make a predictory line between two columns of data.

Creating & Filtering my “Streaming History” DataFrame#

import pandas as pd
#first, lets import all of our files and use "sh" for "streaming history"
sh1 = pd.read_json("StreamingHistory0.json") #10,000 rows
sh2 = pd.read_json("StreamingHistory1.json") #10,000 rows
sh3 = pd.read_json("StreamingHistory2.json") #10,000 rows
sh4 = pd.read_json("StreamingHistory3.json") #10,000 rows
sh5 = pd.read_json("StreamingHistory4.json") #1,650 rows
#next, because each file is of my streaming history during a different period of time, let's combine them
sh = pd.DataFrame()
sh = sh.append([sh1, sh2, sh3, sh4, sh5])
#to check it is properly combined, let's check the shape of this new dataframe - it should be 41,650 rows
sh.shape
(41650, 4)
sh
#pulling this up to get the column names
endTime artistName trackName msPlayed
0 2022-09-08 00:35 Jon Bellion All Time Low 119820
1 2022-09-08 00:35 Jon Bellion Eyes To The Sky 1560
2 2022-09-08 00:38 Lawrence False Alarms (with Jon Bellion) 84710
3 2022-09-08 00:38 Jon Bellion While You Count Sheep 1250
4 2022-09-08 00:39 Jon Bellion Blu 58630
... ... ... ... ...
1645 2023-09-08 23:37 Hozier De Selby (Part 1) 3940
1646 2023-09-08 23:37 half•alive Nobody - Live 220990
1647 2023-09-08 23:37 half•alive What's Wrong 36933
1648 2023-09-08 23:45 Hozier Unknown / Nth 280106
1649 2023-09-08 23:58 Hozier First Light 292080

41650 rows Ă— 4 columns

Next, I want to filter out certain tracks recorded I don’t want included. More specifically, I personally listen white noise and rain sounds on Spotify frequently when getting work done as a means to help me keep focus. Because of how many minutes would be recorded of this and it’s irrelevancy to an analysis of my taste in music, let’s remove it.

# first, let's find the full name of the white noise track
sh[sh['trackName'].str.contains('Noise')]
# "White Noise 2 Ho...", "White Noise 3 Ho...", and "Sleep Sounds Rai..." are all white noise tracks
# "Street Noise" & "Turn Off The Noise" are songs we will include in our work
endTime artistName trackName msPlayed
1367 2022-09-25 03:41 Erik Eriksson White Noise 2 Hour Long 4216
1383 2022-09-25 04:05 Erik Eriksson White Noise 2 Hour Long 560
5253 2022-11-01 18:33 Erik Eriksson White Noise 2 Hour Long 612690
5254 2022-11-01 18:33 Erik Eriksson White Noise 2 Hour Long 9600
5256 2022-11-01 19:42 Erik Eriksson White Noise 2 Hour Long 2822980
5257 2022-11-01 19:46 Erik Eriksson White Noise 2 Hour Long 255920
5258 2022-11-01 20:43 Erik Eriksson White Noise 2 Hour Long 1380030
5259 2022-11-01 21:21 Erik Eriksson White Noise 2 Hour Long 666990
5367 2022-11-02 04:23 Erik Eriksson White Noise 2 Hour Long 16362
1684 2023-01-10 18:39 Erik Eriksson White Noise 2 Hour Long 2326260
2009 2023-01-14 21:19 Erik Eriksson White Noise 3 Hour Long 826080
2831 2023-01-24 18:32 Erik Eriksson White Noise 3 Hour Long 5289087
2833 2023-01-24 19:00 Erik Eriksson White Noise 3 Hour Long 1660
3012 2023-01-26 06:28 Erik Eriksson White Noise 2 Hour Long 1443330
4503 2023-02-02 18:09 Thymes Street Noise 1045
4504 2023-02-02 18:09 Thymes Street Noise 30859
4505 2023-02-02 18:11 Thymes Street Noise 83412
4758 2023-02-03 19:31 Thymes Street Noise 114000
9490 2023-03-05 14:29 Relaxing White Noise Sleep Sounds Rain & Thunderstorm White Noise 8... 44864
467 2023-03-15 15:45 Erik Eriksson White Noise 2 Hour Long 28053
1121 2023-03-21 04:40 Erik Eriksson White Noise 2 Hour Long 7200649
1122 2023-03-21 05:36 Erik Eriksson White Noise 2 Hour Long 32340
1136 2023-03-21 21:45 Erik Eriksson White Noise 2 Hour Long 252670
1704 2023-03-26 11:52 Erik Eriksson White Noise 2 Hour Long 13888
4480 2023-04-21 04:35 Erik Eriksson White Noise 2 Hour Long 1696290
4481 2023-04-21 04:36 Erik Eriksson White Noise 2 Hour Long 1660
4510 2023-04-21 16:37 Erik Eriksson White Noise 2 Hour Long 4806880
4842 2023-04-24 06:01 Erik Eriksson White Noise 2 Hour Long 3530618
6083 2023-05-03 21:54 Erik Eriksson White Noise 3 Hour Long 5515784
7228 2023-05-10 18:22 Erik Eriksson White Noise 2 Hour Long 928
7427 2023-05-12 00:44 Erik Eriksson White Noise 2 Hour Long 2836290
7877 2023-05-15 06:50 Erik Eriksson White Noise 2 Hour Long 7200649
7880 2023-05-15 07:48 Erik Eriksson White Noise 2 Hour Long 450050
237 2023-06-01 05:56 Erik Eriksson White Noise 2 Hour Long 2410
238 2023-06-01 05:56 Erik Eriksson White Noise 2 Hour Long 1821721
429 2023-06-02 01:38 Erik Eriksson White Noise 2 Hour Long 2261210
1467 2023-06-12 06:26 Erik Eriksson White Noise 2 Hour Long 5634016
5827 2023-07-12 05:46 Peter McPoland Turn Off The Noise 231561
9019 2023-08-14 21:35 Erik Eriksson White Noise 2 Hour Long 4834050
760 2023-08-30 22:47 Erik Eriksson White Noise 2 Hour Long 6316280
1483 2023-09-06 22:55 Erik Eriksson White Noise 2 Hour Long 2640683
1484 2023-09-06 23:45 Erik Eriksson White Noise 2 Hour Long 1550920
sh = sh[~sh['trackName'].str.contains('White Noise|Sleep Sounds', case=False)]
# to test this worked, this should now only return the songs "Street Noise" and "Turn off the Noise"
sh[sh['trackName'].str.contains('Noise')]
endTime artistName trackName msPlayed
4503 2023-02-02 18:09 Thymes Street Noise 1045
4504 2023-02-02 18:09 Thymes Street Noise 30859
4505 2023-02-02 18:11 Thymes Street Noise 83412
4758 2023-02-03 19:31 Thymes Street Noise 114000
5827 2023-07-12 05:46 Peter McPoland Turn Off The Noise 231561

Now, let’s make sure there are no missing values in any of our columns or rows.

sh.isnull().sum()
# nope, we're all good to go :)
endTime       0
artistName    0
trackName     0
msPlayed      0
dtype: int64

We also need to convert the “endTime” column into usable dates instead of strings for later use. I will also create a new “minutesPlayed” column from the “msPlayed” column for the sake of legibility of units.

sh["endTime"] = pd.to_datetime(sh["endTime"])
sh["minutesPlayed"] = sh["msPlayed"]/60000
#I'm not sure why this caveat description at the bottom is popping up
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

Filtering the “1 Million Songs” DataFrame into a new sub-DataFrame#

#let's use "ms" for "million songs"
ms = pd.read_csv("spotify_data.csv")
ms.isnull().sum()
#for a dataframe of 1 million tracks, miraculously there are no missing values
Unnamed: 0          0
artist_name         0
track_name          0
track_id            0
popularity          0
year                0
genre               0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
dtype: int64
df_ms = pd.merge(ms, sh, left_on=['track_name', 'artist_name'], right_on=['trackName', 'artistName'], how='inner')
#Now I do not need repeat columns of the same information (and one random column)
df_ms = df_ms.drop(['artistName', 'trackName', 'Unnamed: 0'], axis=1)
df_ms
artist_name track_name track_id popularity year genre danceability energy key loudness ... acousticness instrumentalness liveness valence tempo duration_ms time_signature endTime msPlayed minutesPlayed
0 Jason Mraz I Won't Give Up 53QF56cjZA9RTuuMZDrSA6 68 2012 acoustic 0.483 0.303 4 -10.058 ... 0.69400 0.000 0.115 0.1390 133.406 240166 3 2023-03-01 20:05:00 36608 0.610133
1 Jason Mraz I Won't Give Up 53QF56cjZA9RTuuMZDrSA6 68 2012 acoustic 0.483 0.303 4 -10.058 ... 0.69400 0.000 0.115 0.1390 133.406 240166 3 2023-03-01 20:09:00 200447 3.340783
2 Jason Mraz I Won't Give Up 53QF56cjZA9RTuuMZDrSA6 68 2012 acoustic 0.483 0.303 4 -10.058 ... 0.69400 0.000 0.115 0.1390 133.406 240166 3 2023-03-01 23:54:00 6498 0.108300
3 Neon Trees Everybody Talks 2iUmqdfGZcHIhS3b9E9EWq 77 2012 alt-rock 0.471 0.924 8 -3.906 ... 0.00301 0.000 0.313 0.7250 154.961 177280 4 2022-09-10 06:39:00 177280 2.954667
4 Neon Trees Everybody Talks 2iUmqdfGZcHIhS3b9E9EWq 77 2012 alt-rock 0.471 0.924 8 -3.906 ... 0.00301 0.000 0.313 0.7250 154.961 177280 4 2022-09-12 04:58:00 177280 2.954667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27858 The Drums I Don't Know How To Love 2YvWonOJesvP0yu9IFJY7S 61 2011 rock 0.411 0.890 11 -6.062 ... 0.06170 0.127 0.227 0.0669 169.970 202054 4 2023-03-13 04:36:00 202053 3.367550
27859 The Drums Days 6113aOfHIC0vbZVDZ6PpRV 44 2011 rock 0.586 0.721 2 -7.743 ... 0.36900 0.160 0.141 0.6570 84.987 269082 4 2023-03-03 19:16:00 269081 4.484683
27860 The Drums Days 6113aOfHIC0vbZVDZ6PpRV 44 2011 rock 0.586 0.721 2 -7.743 ... 0.36900 0.160 0.141 0.6570 84.987 269082 4 2023-03-25 00:26:00 105290 1.754833
27861 The Drums Days 6113aOfHIC0vbZVDZ6PpRV 44 2011 rock 0.586 0.721 2 -7.743 ... 0.36900 0.160 0.141 0.6570 84.987 269082 4 2023-03-28 04:55:00 269081 4.484683
27862 The Drums Days 6113aOfHIC0vbZVDZ6PpRV 44 2011 rock 0.586 0.721 2 -7.743 ... 0.36900 0.160 0.141 0.6570 84.987 269082 4 2023-03-28 05:17:00 2070 0.034500

27863 rows Ă— 22 columns

I’d also like to briefly look at the start and end dates of when streams were recorded for the sake of seeing how long my data was recorded and if there were any changes to the earliest and most recent streams included.

earliest_date = sh['endTime'].min()
latest_date = sh['endTime'].max()
earliest_date_kept = df_ms['endTime'].min()
latest_date_kept = df_ms['endTime'].max()
earliest_date
Timestamp('2022-09-08 00:35:00')
earliest_date_kept
Timestamp('2022-09-08 00:35:00')
latest_date
Timestamp('2023-09-08 23:58:00')
latest_date_kept
Timestamp('2023-09-08 23:37:00')

Out of the 41,650 initial streams recorded in my streaming history, we are left with 27,863 streams. From a random sample of 1 million songs, to still be left with 66% of what I have streamed is impressive. But unfortunately that does mean that 34% of what I have streamed will not be represented in Altair charts or any work going forward. Also, I personally know that I downloaded Spotify back in 2021, so even the inital “sh” dataframe contained only streams going back to one year ago and not my “entire” streaming history. With all of that being said, my work going forward will show include this a sample of my overall streaming history that will be used in the rest of the project.

Determining my Most Streamed Songs and Artists#

Next, instead of having 27,863 rows for each stream, lets combine each stream by song/artist and create two new columns; “total_times_streamed,” and “total_minutes_streamed.”

most_songs = df_ms.groupby(['artist_name', 'track_name']).agg(
    total_times_streamed=('minutesPlayed', 'count'),
    total_minutes_streamed=('minutesPlayed', 'sum')
).reset_index()
most_songs
streams = pd.merge(most_songs, df_ms, left_on=['artist_name', 'track_name'], right_on=['artist_name', 'track_name'], how='inner')
#i no longer need the endTime, msPlayed, or minutesPlayed columns
#as they are keeping me from being able to drop any duplicates and are accounted for in
# the total number and total time listened
streams = streams.drop(["endTime", "msPlayed", "minutesPlayed"], axis=1)
streams = streams.drop_duplicates()
streams
artist_name track_name total_times_streamed total_minutes_streamed track_id popularity year genre danceability energy ... loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature
0 $uicideboy$ $uicideboy$ Were Better In 2015 1 2.389100 6LoaYlv0bC1TyctuADqNFh 66 2022 hip-hop 0.883 0.8220 ... -4.029 0 0.1080 0.0301 0.000002 0.1110 0.3270 110.024 143347 4
1 $uicideboy$ 1000 Blunts 1 2.924600 09riz9pAPJyYYDVynE5xxY 75 2022 hip-hop 0.830 0.6980 ... -6.517 0 0.0770 0.2240 0.000001 0.1910 0.5950 132.990 175476 4
2 $uicideboy$ Antarctica 2 0.030167 5UGAXwbA17bUC0K9uquGY2 77 2016 hip-hop 0.715 0.6330 ... -6.869 1 0.0804 0.5530 0.000004 0.0905 0.3190 105.945 126850 5
4 $uicideboy$ Avalon 1 0.266917 7CxFWAnQ8eqiRL4W12Xzb6 68 2021 hip-hop 0.877 0.6000 ... -4.577 1 0.0813 0.0210 0.000054 0.2440 0.1760 149.996 140859 4
5 $uicideboy$ For the Last Time 5 7.830050 240audWazVjwvwh7XwfSZE 74 2017 hip-hop 0.844 0.5330 ... -9.612 1 0.5520 0.0735 0.000003 0.0953 0.2300 140.078 156081 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27855 soho At Peace 2 0.010433 7fJ1v1CninD1DsfNVbs4HU 34 2018 chill 0.809 0.3040 ... -12.764 1 0.2180 0.9630 0.898000 0.1080 0.5270 79.971 120000 4
27857 thuy girls like me don't cry 1 0.097000 2DtUUBwYwEzKMTMDrc5EiO 64 2022 chill 0.871 0.3720 ... -9.077 0 0.0413 0.2530 0.000002 0.1040 0.6080 110.011 214387 4
27858 thuy universe 1 0.091667 7B4UxdHwRKJYRhvXxmgZhM 62 2021 chill 0.636 0.4520 ... -8.298 1 0.0329 0.1360 0.000002 0.1040 0.0678 80.004 186627 4
27859 Ólafur Arnalds Saudade (When We Are Born) 3 7.500000 1ijwLR1iybtxaUbasUj7kJ 59 2021 ambient 0.289 0.0253 ... -31.435 1 0.0376 0.9940 0.919000 0.0837 0.1380 99.801 150000 4
27862 Ólafur Arnalds So Far 1 0.041017 6oVhL0lLUMswqSV3VcKwJO 50 2015 ambient 0.462 0.3390 ... -14.301 0 0.0418 0.8130 0.808000 0.1080 0.0395 115.015 272014 4

3742 rows Ă— 21 columns

Finally, I have the final form of my dataset that represents my streaming history, time spent listening to each artist or song, and each unique attribute Spotify records for each unique song. To check that this final step was done correctly, the sum of “total_times_streamed” should equal the number of rows in df_ms (27,863)

streams["total_times_streamed"].sum()
27863

Now that we have this final form of my data, let’s determine my most streamed artists and most streamed songs that exists within both my streaming history and the dataset of 1 million songs.

# five most streamed songs
top_tracks = streams.groupby('track_name')['total_minutes_streamed'].sum()
# Sort the tracks by total_minutes_streamed in descending order and select the top 5
# and round it so there aren't any endless decimals
top_tracks.sort_values(ascending=False).head(5).round(3)
track_name
Stories                                       795.002
Stick Season                                  715.914
One More Time                                 666.362
Ode to a Conversation Stuck in Your Throat    645.537
All My Love                                   607.431
Name: total_minutes_streamed, dtype: float64
# five most streamed artists
top_artists = streams.groupby('artist_name')['total_minutes_streamed'].sum()
# Sort the tracks by total_minutes_streamed in descending order and select the top 5
# and round it so there aren't any endless decimals
top_artists.sort_values(ascending=False).head(5).round(3)
artist_name
Noah Kahan            5857.644
Paramore              3033.930
Taylor Swift          2595.682
Hippo Campus          1629.579
Tyler, The Creator    1564.350
Name: total_minutes_streamed, dtype: float64

According to my filtered dataset of “streams”, my five most streamed songs of the past year are “Stories”, “Stick Season”, “One More Time”, “Ode to a Conversation Stuck in Your Throat”, and “All My Love” and that my five most streamed artists are Noah Kahan, Paramore, Taylor Swift, Hippo Campus, and Tyler the Creator. While a large amount of streams were lost in filtering through the “1 Million Songs” Dataset, I can tell you with confidence that these results are very representative of my taste in music.

Visualizing My Taste in Music#

import altair as alt

This scatterplot shows the entirety of “streams” according to energy and popularity, and the darker colors signfy where my most streamed songs lay with the rest.

alt.Chart(streams).mark_circle().encode(
    x='popularity:Q',
    y='energy:Q',
    color=alt.Color("total_minutes_streamed:Q", scale=alt.Scale(scheme="goldorange")),
    tooltip=["artist_name","track_name","total_minutes_streamed", "total_times_streamed","genre"],
)

Unfortunately, that means it is hard to see the rest of my overall taste in music. So, let’s exclude the most streamed songs (those higher than about 300 minutes) as outliers. I will also cut the amount of songs in half so that I can have better accuracy in my interaction with each point when hovering over it.

df_streams2 = streams[streams["total_minutes_streamed"] < 300]
df_streams = df_streams2.sample(frac=0.5, random_state=76) 

alt.Chart(df_streams).mark_circle().encode(
    x='popularity:Q',
    y='energy:Q',
    color=alt.Color("total_minutes_streamed:Q", scale=alt.Scale(scheme="goldorange")),
    tooltip=["artist_name","track_name","total_minutes_streamed", "total_times_streamed","genre"],
)

Now that I can see more of the higher ends of “total minutes streamed”, I can see that the majority of my streams rest within 40 to 80 on the popularity scale, but are far more spread out across energy. This tells me that while I am more likely to listen to songs that are “heard of but not too popular,” I will listen to any a wider range of energy levels, while still leaning towards higher energy music. Next, I want to see how the year released plays into my taste in music.

yearly_minutes = streams.groupby('year')['total_minutes_streamed'].sum().round(2).reset_index()

chart = alt.Chart(yearly_minutes).mark_bar().encode(
    x='year:N',
    y='total_minutes_streamed:Q',
    tooltip=['year:N', 'total_minutes_streamed:Q']
)
chart

This suggests to me that there is an error with the dataset I did not anticipate, but now somewhat understand. The “year” variable certainly does not represent the year originally released as I anticipated. But it possibly represents the year that the song was added (or last updated/remastered) on Spotify. But even that makes no sense as Spotify was created in 2006, yet the oldest year listed is 2000. But, for whatver true meaning given to year if not actual year released, this shows my preference of listening to music made within the last 6-7 years. The last chart I would like to examine which genres recorded are more listened to than others.

streams_df_genre = streams.sort_values(by='total_minutes_streamed', ascending=False)

chart = alt.Chart(streams_df_genre).mark_bar().encode(
    x=alt.X('genre:N', sort='y'), 
    y='total_minutes_streamed:Q',
    tooltip=['genre:N', 'total_minutes_streamed:Q']
)
chart

While I was not expecting “electro” to be my most streamed genre, it does make sense to be up there. This all makes sense to me as genres like pop, rock and indie-pop are sure to be broad enough genres to cover plenty of songs I listen to.

Machine Learning#

Can I use Linear Regression to predict a trend between two variables given? For example, in my selection in music, how do loudness and energy interact?

from sklearn.linear_model import LinearRegression

X = streams[['energy']]
y = streams['loudness']
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
streams['predictions'] = predictions

scatter_plot = alt.Chart(streams).mark_circle().encode(
    x='energy:Q',
    y='loudness:Q',
    tooltip=['energy:Q', 'loudness:Q'],
)

best_fit_line = alt.Chart(streams).mark_line(color='red').encode(
    x='energy:Q',
    y='predictions:Q',
)

combined_chart = scatter_plot + best_fit_line
combined_chart

I used scikit-learn to create a LinearRegression model, fit it to the data, and calculate predictions to add as a new “predictions” column to my “streams” dataframe. I then created an Altair scatterplot with my streams and a best fit line with the predictions, combining the two into one chart.

Unsuprisingly, there is a positive correlation between “loudness” and “energy.” What is interesting however is that the best-fit line cannot fit linearly for the “dropoff” in energy as loudness decreases. The true nature of a best-fit line for the relationship between “loudness” and “energy” would be closer to that of a square root line, so the best-fit line given fails to properly predict songs with loudness values less than around -15.

Let’s see if we can make a better fitting line with Polynomial Regression (to the second degree).

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)  # You can change the degree as needed
X_poly = poly.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
y_pred = poly_reg.predict(X_poly)
streams['predictions_poly'] = y_pred
chart = alt.Chart(streams).mark_circle().encode(
    x='energy:Q',
    y='loudness:Q',
    tooltip=['energy:Q', 'loudness:Q']
)

# Overlay the polynomial regression line on the chart
regression_line = alt.Chart(streams).mark_line(color='red').encode(
    x='energy:Q',
    y='predictions_poly:Q'
)

chart + regression_line

As you can see, with the change of it being a polynomial to the second dgree now better represents the curve made as both energy and loudness decrease. If I were to add any more polynomials, the curvature provided would be forceful against the true nature of the relationship between loudness and energy and would result in overfitting. The argument can also be made that now the issue has reversed now. Instead of how the data predicted for values less than -15 would be too high, the new polynomial regression curve fails to address the more linear and upward realtionship between the two somewhere around (energy = 0.2, loudness = -15).

Summary#

First, I created a dataframe of my own out of two pre-existing dataframes and determined certain maximums and minimums pertaining to music. Then, I visualized different traits/columns recorded and their nature with one another. Finally, I used Linear and Polynomial regression to make a best-fit line predicitng the nature between two of those traits/columns.

References#

Your code above should include references. Here is some additional space for references.

Created in deepnote.com Created in Deepnote